The Annals of Statistics 

2007, Vol. 35, No. 1, 192-223 

DOI: 10.1214/009053606000001172 

© Institute of Mathematical Statistics. 2007 



CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS FOR 
NONIID OBSERVATIONS 

By Subhashis Ghosal 1 and Aad van der Vaart 

North Carolina State University and Vrije Universiteit Amsterdam 

We consider the asymptotic behavior of posterior distributions 
and Bayes estimators based on observations which are required to 
be neither independent nor identically distributed. We give general 
results on the rate of convergence of the posterior measure rela- 
tive to distances derived from a testing criterion. We then special- 
ize our results to independent, nonidentically distributed observa- 
tions, Markov processes, stationary Gaussian time series and the 
white noise model. We apply our general results to several examples 
of infinite-dimensional statistical models including nonparametric re- 
gression with normal errors, binary regression, Poisson regression, an 
interval censoring model, Whittle estimation of the spectral density 
of a time series and a nonlinear autoregressive model. 

1. Introduction. Let (*( n ) , A {n) , P e (n) : 9 G 0) be a sequence of statistical 
experiments with observations X^ n \ where the parameter set is arbitrary 
and n is an indexing parameter, usually the sample size. We put a prior 
distribution IL,, on 9 G and study the rate of convergence of the posterior 
distribution Ii n {-\X^) under Pq , where 9q is the "true value" of the pa- 
rameter. The rate of this convergence can be measured by the size of the 
smallest shrinking balls around 9q that contain most of the posterior prob- 
ability. For parametric models with independent and identically distributed 
(i.i.d.) observations, it is well known that the posterior distribution converges 
at the rate n~ 1 / 2 . When is infinite-dimensional, but the observations are 
i.i.d., Ghosal, Ghosh and van der Vaart [14] obtained rates of convergence 
in terms of the size of the model (measured by the metric entropy or exis- 
tence of certain tests) and the concentration rate of the prior around 9$ and 
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computed the rate of convergence for a variety of examples. A similar result 
was obtained by Shen and Wasserman [27] under stronger conditions. 

Little is known about the asymptotic behavior of the posterior distribu- 
tion in infinite-dimensional models when the observations are not i.i.d. For 
independent, nonidentically distributed (i.n.i.d.) observations, consistency 
has recently been addressed by Amewou-Atisso, Ghosal, Ghosh and Ra- 
mamoorthi [1] and Choudhuri, Ghosal and Roy [7]. The main purpose of 
the present paper is to obtain a theorem on rates of convergence of posterior 
distributions in a general framework not restricted to the setup of i.i.d. ob- 
servations. We specialize this theorem to several classes of non-i.i.d. models 
including i.n.i.d. observations, Gaussian time series, Markov processes and 
the white noise model. The theorem applies in every situation where it is 
possible to test the true parameter versus balls of alternatives with expo- 
nential error probabilities and it is not restricted to any particular structure 
on the joint distribution. The existence of such tests has been proven in 
many special cases by Le Cam [20, 21, 22] and Birge [3, 4, 5], who used 
them to construct estimators with optimal rates of convergence, determined 
by the (local) metric entropy or "Le Cam dimension" of the model. Our 
main theorem uses the same metric entropy measure of the complexity of 
the model and combines this with a measure of prior concentration around 
the true parameter to obtain a bound on the posterior rate of convergence, 
generalizing the corresponding result of Ghosal, Ghosh and van der Vaart 
[14]. We apply these results to obtain posterior convergence rates for linear 
regression, nonparametric regression, binary regression, Poisson regression, 
interval censoring, spectral density estimation and nonlinear autor egression, 
van der Meulen, van der Vaart and van Zanten [30] have extended the ap- 
proach of this paper to several types of diffusion models. 

The organization of the paper is as follows. In the next section, we de- 
scribe our main theorem in an abstract framework. In Sections 3, 4, 5 and 6, 
we specialize to i.n.i.d. observations, Markov chains, the white noise model 
and Gaussian time series, respectively. In Section 7, we discuss a large num- 
ber of more concrete applications, combining models of various types with 
many types of different priors, including priors based on the Dirichlet pro- 
cess, mixture representations or sequence expansions on spline bases, priors 
supported on finite sieves and conjugate Gaussian priors. Technical proofs, 
including the proofs of the main results, are collected in Section 8. 

The notation < will be used to denote inequality up to a constant that 
is fixed throughout. The notation Pf will abbreviate / fdP. The sym- 
bol \_x\ will stand for the greatest integer less than or equal to x. Let 
Kf,g) = (J(f 1/2 - g 1 / 2 ) 2 dfi) 1 / 2 and K(f,g) = J /log(//<?) d/x stand for the 
Hellinger distance and Kullback-Leibler divergence, respectively, between 
two nonnegative densities / and g relative to a measure [i. Furthermore, we 
define additional discrepancy measures by Vk(f,g) = J f\ log(f/g)\ k dp, and 
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V k)0 (f 1 g)=Jf\log(f/g)-K(f,g)\ k d^ k>l. The index k = 2 of V 2 and 
V2,o may be omitted and these simply written as V and Vq, respectively. 
The symbols N and R will denote the sets of natural and real numbers, 
respectively. The e-covering number of a set G for a semimetric d, denoted 
by N(e,@,d), is the minimal number of d-balls of radius e needed to cover 
the set 0; see, for example, [31]. 

In) 

2. General theorem. For each n G N and 9 € 0, let Pg admit densities 

Pg relative to a c-finite measure . Assume that (x, 9) \— ► (x) is jointly 
measurable relative to A®B, where B is a cr-field on 0. By Bayes' theorem, 
the posterior distribution is given by 

p.!, n „ (B | X „„ ) = JBf'(^'"')^W, Bee . 

/ eP W(jr(-))<fli„(») 

Here, is an "observation," which, in our setup, will be understood to 

In) 

be generated according to Pg for some given 9q €&■ 

For each n, let d n and e n be semimetrics on with the property that 
there exist universal constants £ > and K > such that for every e > 
and for each ^! S with d n (9\, 0$) > e, there exists a test n such that 

(2.2) Pi;Vn<e-* n£2 , sup Pi n) (l-0„)<e-^ £2 . 

0ee:e„(0,0i)<e£ 

Typically, we have d n < e n and in many cases we choose d n = e n , but using 
two semimetrics provides some added flexibility. Le Cam [20, 21, 22] and 
Birge [3, 4, 5] showed that the rate of convergence, in a minimax sense, of the 
best estimators of 9 relative to the distance d n can be understood in terms of 
the Le Cam dimension or local entropy function of the set relative to d n . 
For our purposes, this dimension is a function whose value at e > is defined 
to be log N(e£, {9 : d n (9, 9q) < e}, e n ), that is, the logarithm of the minimum 
number of d n -balls of radius e£ needed to cover an 6 n -ball of radius s around 
the true parameter #o- Birge [3, 4] and Le Cam [20, 21, 22] showed that 
there exist estimators 9 n = 9 n (!<")) such that d n {9 n ,9 ) = P (e n ) under 
Pq™\ where 

(2.3) sup logiV(e£, {9 : d n {9, 9 ) < e},e n ) < ne 2 n . 

Further, under certain conditions e n is the best rate obtainable, given the 
model, and hence gives a minimax rate. 

As in the i.i.d. case, the behavior of posterior distributions depends on 
the size of the model measured by (2.3) and the concentration rate of the 
prior Il n at 9q. For a given k > 1, let 

B n (9 , e;k) = {9e&: K(j$ J Q n) ) < ne\ V^p^J^) < n k / 2 e k }. 
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An appropriate condition will appear as a lower bound on U n (B n (8o; e, k)) 
with k = 2 being good enough to establish convergence in mean. For almost 
sure convergence, or convergence of the posterior mean, better control may 
be needed (through a larger value of k), depending on the rate of conver- 
gence. 

The following result, generalizing Theorem 2.4 of Ghosal, Ghosh and van 
der Vaart [14] for the i.i.d. case, bounds the rate of posterior convergence. 

Theorem 1. Let d n and e n be semimetrics on for which tests satis- 
fying (2.2) exist. Let e n > 0, e n — *■ 0, (ne^)^ 1 = 0(1), k > 1, and n C be 
such that for every sufficiently large j £ N, 

(2.4) sup log n(Is^, {9 £e n :d n (6,e )<e},e n ) <ne*, 

e>e n \Z J 

(25) n n (6> £ n : je n < d n (6, go) < 2je n ) < K ne 2 n j 2 /2 

Then for every M n — > oo , we have that 

(2.6) P^n(0 £ 6 n : d n (9, 9 ) > M n e n \X^) -> 0. 

The theorem uses the fact that n C to alleviate the entropy condition 
(2.4), but returns an assertion about the posterior distribution on n only. 

The complementary assertion pjj™ Il n (0 \ @ n \X^) — ► may be handled 
either by a direct argument or by the following analog of Lemma 5 of [2] . 

Lemma 1. If jr^§^^~Fj) = °( e ~ 2ne " ) f or some k>l, then P^ ) U n (Q\ 
@ n \X^)^0. 

The choice n = 0, which makes the condition of Lemma 1 trivial, im- 
poses a much stronger restriction on (2.4) and is generally unattainable when 
is not compact. 

The following theorem extends the convergence in Theorem 1 to almost 
sure convergence and yields a rate for the convergence under slightly stronger 
conditions. 

Theorem 2. In the situation of Theorem 1, 

(i) if all X^ n > are defined on a fixed sample space and e n > n~ a for some 
a £ (0,1/2) such that k{\ — 2a) > 2, then the convergence (2.6) also holds 
in the almost sure sense; 

(ii) if e n > n~ a for some a £ (0, 1/2) such that k(l — 2a) > 4a, then the 
left side of (2.6) is 0(4)- 
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If is a convex set and d\ is a convex function in one argument keeping 
the other fixed and is bounded above by B, then for 9 n = J 9 dU n (9\X^), 
we have, by Jensen's inequality, that 

d 2 n n ,9 o )< J d 2 n (9,9 )dU n (e\X^)<el + B 2 U n (d n (9,9 )>e n \X^). 

This yields the rate e n for the point estimator 8 n under the conditions of 
Theorem 1. 

The complicated-looking condition (2.5) can often be simplified in infinite- 
dimensional cases, where, typically, ne^ — > oo. Because the numerator in 
(2.5) is trivially bounded by one, a sufficient condition for (2.5) is that 

2 

^n{B n (9o,s n ,k)) >e~ cn£n . The local entropy in condition (2.4) can also 
often be replaced by the global entropy log iV(e£/2, n , e n ) without affecting 
rates. Also, if the prior is such that the minimax rate given by (2.3) satisfies 
(2.5) and the condition of Lemma 1, then the posterior convergence rate 
attains the minimax rate. 

Entropy conditions, however, may not always be appropriate to ensure 
the existence of tests. Ad hoc tests may sometimes be more conveniently 
constructed. A more general theorem on convergence rates, which is formu- 
lated directly in terms of tests and stated below, may be proven in a similar 
manner. 

Theorem 3. Let d n be a semimetric on 0, e n — > 0, (ne^)" 1 = 0(1), 
k > 1, K > 0, n C and 4> n be a sequence of test functions such that 

(2.7) sup Fftl-^)<e-^» 

6ee n :je n <d n (e,9o)<2j£ n 

and (2.5) holds. Then for every M n — > oo, we have that P^U n (9 G n : 
d n {9,8 )>M n E n \xW)^0. 

3. Independent observations. In this section, we consider the case where 
the observation X^ is a vector X^ = (X\, X2, ■ ■ ■ , X n ) of independent 

(n) 

observations Xj. Thus, we take the measures Pq of Section 2 equal to 
product measures (}§i = \Pe,i on a product measurable space ®" =1 (3£i,»4i). 
We assume that the distribution Pg^ of the ith component X^ possesses a 
density pg^ relative to a c-finite measure /ij on (Xi,Ai), i = 1, . . . ,n. In this 
case, tests can be constructed relative to the semimetric d n , whose square 
is given by 

1 n r 

(3.1) 4(m') = -£ {vm-vw:i?d^. 

Ti . _. J 

1=1 



6 S. GHOSAL AND A. W. VAN DER VAART 

Thus, d\ is the average of the squares of the Hellinger distances for the 
distributions of the individual observations. 

The following lemma, due to Birge (cf. [22], page 491, or [4], Corollary 2 
on page 149), guarantees the existence of tests satisfying the conditions of 
(2.2). 

LEMMA 2. If P^ are product measures and d n is defined by (3.1), then 
there exist tests <p n such that Pj"Vn < e^^ nd ^ e ° fil) and P e {n \l - <j> n ) < 
e -M(flb,9i) j or a n £ Q suc h that d n (9,9x) < ±d n (9 , 6>i). 

The Kullback-Leibler divergence between product measures is equal to 
the sum of the Kullback-Leibler divergences between the individual com- 
ponents. Furthermore, as a consequence of the Marcinkiewiz-Zygmund in- 
equality (e.g., [9], page 356), the mean Y n of n independent random variables 
satisfies E\Y n - EY n \ k < C k n~ k l 2 \ X^ =1 E|Y;| fc for k > 2, where C k is a con- 
stant depending only on k. Therefore, the set B n (9o,e;k) contains the set 

f 1 n 1 n 1 

B*(0 Q ,e;k) = \9 G G : -^K^O) < e 2 , -^^0,9) < C k e k , 

n r— f n ~ 

\ i=i i=i ) 

where Ki(9 , 9) = K(Pg ^, P 6)i ) and Vfc )O; i(0o, 6) = V k)0 (Pe 0> i, Pe,i)- Thus, we 
can work with a "ball" around Oq relative to the average Kullback-Leibler 
divergence and the average kth order moments, as in the preceding display, 
and simplify Theorem 1 to the following result: 

Theorem 4. Let Pg be product measures and d n be defined by (3.1). 
Suppose that for a sequence e n — > such that ne^ is bounded away from zero, 
some k > 1, all sufficiently large j and sets n C 0, the following conditions 
hold: 

(3.2) sup logN(e/m,{9 G 9 n : d n (9,9 ) < e},d n ) < ne 2 n , 

e>e„ 



(3 3) n "( Q \ e ") =o(e -*ne 

{ 1 IL n (B*(9 ,e n ;k)) ° [6 



■2 



(3.4) 



U n (9 G Q n ■ j£n <d n {9, 9 ) < 2je n ) ne ij2 /4 
U n (B*(9o,e n ;k)) 



Then Pffli n {9 : d n (9, 9 ) > M n e n \X^) for every M n oo. 

The average Hellinger distance is not always the most natural choice. It 
can be replaced by any other distance d n that satisfies (3.2)-(3.3) and for 
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which the conclusion of Lemma 2 holds. Often, we set k = 2 and work with 
the smaller neighborhood 

f 1 n 1 n 1 

(3.5) B n (0 ,e) = 9 : -£^(00,0) < e 2 , -J2 V *A e o,0) <e 2 \. 

\ 2=1 1=1 ) 

4. Markov chains. For 9 ranging over a set B, let (x,y) ^ pg(y\x) be 
a collection of transition densities from a measurable space (X,A) into it- 
self, relative to some reference measure v. Thus, for each G G, the map 
(x, y) i — > pe(y|a;) is measurable and for each x, the map y i— > is a prob- 

ability density relative to /i. Let -Xo,Ai,... be a stationary Markov chain 
generated according to the transition density pg, where it is assumed that 

(n) 

there exists a stationary distribution Qg with /i-density qg. Let Pq be the 
law of {X ,Xt, . . .,X n ). 

Tests satisfying the conditions of (2.2) can be obtained from results of 
Birge [4], which are more refined versions of his own results in [3]. A special 
case is presented as Lemma 3 below. Actually, Birge's result ([4], Theorem 3, 
page 155) is much more general in that it also applies to nonstationary chains 
and allows different upper and lower bounds, as seen in the following display. 

Assume that there exists a finite measure v on (3t,A) such that, for some 
k, I € N, every 9 G © and every x G X and A G A, 

1 k 

(4.1) Fg(Xt G A\X = x)< v{A) <-J2 Pfl(X,- G A\X = x), 

j'=i 

where Pg is the generic notation for any probability law governed by 9. For 
instance, if there exists a ^-integrable function r such that r(y) <pg(y\x) < 
r(y) for every (x,y), then (4.1) holds with the measure v given by dv{y) = 
r(y) dfj,(y). Define the square of a semidistance d by 

(4.2) d 2 (9, 0')=jj y P9 (y\x) - ^Pe>(y\x)] 2 dix{y) du(x). 

Lemma 3. If there exist k, I and a measure v such that (4.1) holds, then 
there exist a constant K depending only on (k, I) and tests 4> n such that 

P^Un < e- Kn *V°M, sup Pi n \l - n ) < e -^(ft»A). 

eee:d(e,0i)<(i(eo ! fi)/8 

The preceding lemma is also true if the chain is not started at stationarity. 
If, as we assume, Xq is generated from a stationary distribution under 6*o, 
then the Kullback-Leibler divergence of Pq™' and Pq^ satisfies 

(4.3) K(P^\P e [n) )=n J K(pg (-\x),pg(-\x))dQg (x) + K(qg ,qg). 
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To handle the neighborhoods B n (9o, e; 2), we need a bound on V(Pg , P ), 
which will also be of the order of n times an expression depending only 
on individual observations, under a variety of conditions. In the following 
lemma, we use an a-mixing assumption. For a sequence {X n }, let the a- 
mixing coefficient be given by ah = sup{|Pr(A / 'o G A,Xh £ B) — Pi(Xq € 
A)Pr(X h eB)\:A,BeB(R)}. 



Lemma 4. Suppose that the Markov chain Xq,Xi, . . . is a-mixing under 
9q, with mixing coefficients ah- Then for every s>2, V(pg^ ,Pg^) is bounded 
by 



oo 

sn 1-2/i 



s-2 



h=0 



log 



Pe {y\x) 



Pe{y\x) 



pe (y\x)dfj,(y)dQg (x] 



2/s 



2V(qe ,<le) 



Proof. We can write 

(n) n 

(4.4) log % = V log ^V^'V + log =: nY n + Z , 



^ S Pe( x i\ x i-i) 8 qe(Xo) 



where Yi = log(pg {Xi\X i - 1 )/pe(X i \X i - 1 )) and Z = \og(qg (X )/q 9 (X Q )). 
Then Yx,Y2,... are a-mixing with mixing coefficients a^-i- Therefore, the 
variance of the left-hand side of (4.4) is bounded above by ra(E| Yj| s ) 2 / S x 
4s(s - 2)" 1 J2hLi a l~ 2 i S ' b y the bound of Ibragimov [18]. □ 

Let 0i C O be the set of parameter values such that K(qQ ,qg) and 
V(qeo,qe) are bounded by 1. Then from (4.3) and Lemma 4, it follows that 
for large n and e 2 > 2/n, the set B n (9o,e;2) contains the set B*(6q,e;s) 
defined by 



ee@i:Pg log[^(Xx\X )) < Je 2 ,P 6 
Pe J 2 



log^(Xi|X )' S 
Pe 



<C s e* 



where the power s must be chosen sufficiently large to ensure that the 
mixing coefficients satisfy X^o Q /i ^ < 00 an d where C s 2 ^ s = 16s(2 — 

«) ~ 1 EhLo a h~ ■ Tne contributions of Qg (log (q 6o /qe)) and Qe (log (qe /qe)) 2 
may also be incorporated into the bound. 

The above facts may be combined to obtain the following result. 

Theorem 5. Let P^ be the distribution of (Xo,Xi, . . . ,X n ) for a sta- 
tionary Markov chain Xq,Xi, . . . with transition densities p$(y\x) and sta- 
tionary density qg satisfying (4.1) and let d be defined by (4.2). Assume, fur- 
ther, that the chain is a-mixing with coefficients Oh satisfying J^h^o ^, < - 
oo for some s > 2. Suppose that for a sequence e n — > such that ne 2 > 2, 
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some s > 2, every sufficiently large j and sets Q n C 0, the following condi- 
tions are satisfied: 

(4.5) sup IogJV(e/16, {9 E 9„ : d{0,6 Q ) < e},d) < ne 2 n , 



£>£r, 



lL n {B*{U ,E n , s)) 

U7) n n (g G 9 TO : (j - l)e n < d{9, 9 ) < je n ) Knelf/% 

{ ' U n (B*(9 ,e n ;s)) 

for the constant K of Lemma 3. Then P^ > IL n {9 : d*(0, 9 ) > M n e n \X^) -> 
/or every M n — ► oo . 

A Markov chain with ra-step transition probability P n (x,-) = Pi(X n G 
A\Xq = x) and stationary measure Q is called uniformly ergodic if ||P n (a;, •) — 
Q\\ — ► as n — > oo, uniformly in x, where || • || is the total variation norm. It 
can be shown that the convergence is then automatically exponentially fast 
(cf. [23], Theorem 16.0.2). Thus, the a-mixing coefficients are exponentially 

decreasing and hence satisfy Y^h^o^h < 00 f° r ever y s > 2. Hence, it 
suffices to verify (4.7) with some arbitrary fixed s > 2. If sup{J \pe Q {y\xi) — 
Pe {y\x2) \ dn{y) : x±,X2 G M} < 2, then integrating out X2 relative to the sta- 
tionary measure qg , we see that Condition (16.8) of Theorem 16.0.2 of [23] 
holds and hence the chain is uniformly ergodic. 

5. White noise model. Let 6 C L 2 [0,1] and for 9 G 9, let P^ n) be the 
distribution on C[0, 1] of the stochastic process = {X[ n) : < t < 1) 

defined structurally as -X^ = Jq 9(s) ds + -j^Wt for a standard Brownian 
motion W. This is the standard white noise model, which is known to arise 
as an approximation of many particular sequences of experiments. An equiv- 
alent experiment is obtained by the one-to-one correspondence of X^ with 
the sequence defined by X n ^ = (X^ n \ei), where (•, •) is the inner product of 
L2[0, 1] and {e\,e2, . . .} is a given orthonormal basis of ^[0, 1]. The variables 
X nj i,X nt 2, ■ ■ ■ are independent and normally distributed, with means {9,ei} 
and variance n^ 1 . In the following, we use this concrete representation and 
abuse notation by identifying X^ with the sequence (X n> \,X n> 2, ■ ■ •) and 
9 G with the sequence (#i, #2, • • •) defined by 9{ = (9, e\) . In the latter rep- 
resentation, we have that Sc^, the space of square summable sequences. 
Let ||0|| 2 = Jq 1 9 2 {s) ds = $1 denote the squared L2-norm. 

Tests satisfying the conditions of (2.2) can easily be found explicitly, 
namely, as the likelihood ratio test for 9q versus 0i, where we can use the 
Z/2-norm for both d n and e n . Furthermore, the Kullback-Leibler divergence 
and discrepancy Vi o also turn out to be multiples of the L2-norm. 



10 



S. GHOSAL AND A. W. VAN DER VAART 



Lemma 5. The test <f) n = l{2(0i - 9 ,X^) > ||6>i|| 2 - ||6» || 2 } satisfies 

P^<t>n < 1 - " ^o||/2) and P, (n) (l - <M < 1 - $(v^||0i - e ||/4) 

/or any 9eQ such that \\9 - 6>i|| < ||0i - # ||/4. 

Lemma 6. For every 9, 9 £ G C L 2 [0, 1], we ftat/e K(P^\ P^ n) ) = \n\\9- 
#o|| 2 and V2fi(Pg™\ Pq^) = n\\9 — 9o\\ 2 . Consequently, we have B n (9o, e; 2) = 
{#£6: ||0-0 O || <£>• 

Proof of lemma 5. The test rejects the null hypothesis for positive 
values of the statistic T n = {9 l -9 ,X i - n ^) -\\\9i\\ 2 + \\%\\ 2 , which, under 9, 
is distributed as (6>i - 9 , 9 - 6>i) + \\\6x - <9 || 2 + 4=(0i - 9 , W). The variable 

(9i — 9o, W) is normally distributed with mean zero and variance \\9\ — 0o|| 2 - 
Under 9 = 9$, the mean of the test statistic is equal to — \ \ \ 9q — 9\ \ \ 2 , whereas 
for \\9 — 9\\\ < £\\9i — 9o\\ and £ G (0, the mean of the statistic under 
9 is bounded below by — £)||0o — 9\\\ 2 , i n view of the Cauchy-Schwarz 
inequality. The lemma follows upon choosing £ = 1/4. □ 

PROOF of lemma 6. We write tog(p$ /p^) = n(0 -0, 1^) - §||0 O || 2 + 
§||0|| 2 , whence the mean and variance under 9q are easily obtained. □ 

In the preceding lemmas, no restriction on the parameter set C ^[0, 1] 
was imposed. The lemmas lead to the following theorem, which gives bounds 
on the rate of convergence in terms of quantities involving the L2- norm only. 

Theorem 6. Let Pg be the distribution on C[0,1] of the solution of 
the diffusion equation dXt = 9{t) dt + n~ 1 / 2 dWt with Xq = 0. Suppose that 
for e n — > 0, (rae 2 ) -1 = 0(1) and O C £2(0, 1], the following conditions are 
satisfied: 

(5.1) sup log N(e/8, {0 £ O : ||0 — 9 \\ < e}, || • ||) < ne 2 n ; 

e>e„ 
for every j £ N 

(52) u n (ee@:\\e-e \\<je n ) ^ nAj > m 

y ' ' ^n(9eQ:\\9-9 Q \\<e n ) ~ 

Then P e (n) II n (0 £ 6 : ||0 - 9 \\ > M n e n \X^) -> for every M n -> 00. 



In Section 7.6, we shall calculate the rate of convergence for a conjugate 
prior. 
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6. Gaussian time series. Suppose that X\,X2, ■ ■ ■ is a stationary Gaus- 
sian process with mean zero and spectral density /, which is known to belong 
to a model T . Let jh(f) = J- n e%hX fW dX be the corresponding autocovari- 

ance function. Let be the distribution of {X\, . . . ,X n ). 

For this situation, we can derive the following lemma from [3]. Let ||/||2 
and ll/Hoo be the L2-norm relative to Lebesgue measure and the uniform 
norm of a function / : (— ir, n] — > R, respectively. 

Lemma 7. Suppose that there exist constants T and M such that || log /||oo < 
r and J^'hL-oo \h\lh(f) — M for every f G T . Then there exist constants £ 
and K depending only on T and M such that for every s>l/ \fn and every 
fo, with ||/i - /0II2 > f e, we have 

(6.1) P^VnV sup Pj n) (l-0„)<e-^ 2 . 

/e^ll/-/i||oc<^ 

Proof. It follows from the assumptions that J2\h\>n/2lh(f) — 2M/n. 
This is bounded by e 2 for e > yJIMjn. The assertion follows from Proposi- 
tion 5.5, page 222 of [3], with <p n = l{log(p^ /pfj) > 0}. □ 

The preceding lemma shows that tests satisfying the conditions of (2.2) 
exist when d n is the /^-distance and when e n is the uniform distance, leading 
to conditions in terms of N(e£,{f G T : ||/ — /0II2 < e}, || • ||oo)- We do not 
know if the Loo-distance can be replaced by the /^-distance. The uniform 
bound on || log/||oo is not unreasonable as it is known that the structure of 
the time series changes dramatically if the spectral density approaches zero. 
The following lemma allows the neighborhoods B n (fo,e;2) to be dealt with 
entirely in terms of balls for the L2-norm. 

Lemma 8. Suppose that there exists constant T such that ||log/||oo < 
T for every f G T . Then there exists a constant C depending only on V 

such that for every f,g G T , we have P^(log(pj /pg 1 )) < Cn\\f — g\\ 2 and 
var p ( Jl) (log(p( n) /^ ) ))<Cn||/- 5 ||2. 

Proof. The (fc,i)th element of the covariance matrix T n (f) of X^ = 
(X\, . . . ,X n ), given the spectral density /, is given byj^e iA ( fc -')/(A)(iAfor 

1 < k, I <n. Using the matrix identities det^-B" 1 ) = det(7 + B~ 1 / 2 (A - 
B)B~ 1 / 2 ) and A~ l - B^ 1 = A~ X {A - B)B~ 1 , we can write 

(n) 

log \ } = -- logdet(7 + T n (gr 1/2 T n (f - g)^)- 1 ' 2 ) 
Pg 

-\{X^) T T n {f)- l T n {g-f)T n {g)- l X^. 
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For a random vector X with mean zero and covariance matrix £, we have 
V(X T AX) = tr(2M) and vai(X T AX) = tr(EAEA) + tr(5L45L4 T ). Hence, 

/ P (n) \ 1 
pj n) (log^yj = --logdet(I + T n (g)-V 2 T n (f - g)T n {g)- 1 ' 2 ) 

-ltr(T n (g-f)T n (g)- 1 ), 

{n) 

4var p(ri) (log^f) = tr(T n (<? - f)T n {g)~ l T n {g - f)T n { g y l ) 
f \ pi ' J 

+ tv(T n (g - f)T n (g)- l T n (f)T n {g)- l T n {g - f)T n {fy l ). 

Define matrix norms by ||^4|| 2 = J2k J2i a 1i = tr(^L4 T ) and |^4| = sup{ ||Ac|| : 
\\x\\ = 1}, where ||x|| is the Euclidean norm. Thentr(A 2 ) < ||A|| 2 and \\AB\\ < 
\A\\\B\\. Furthermore, as a result of the inequalities — ^/x 2 < log(l + /i) — 
fj, < 0, for all /i > 0, we have for any nonnegative definite matrix A that 
-±tr(A 2 ) < logdet(/ + A) -tr(A) < 0. In view of the identities x T T n (f)x = 

j\Zk x kd ikX \ 2 f{\)d\ and x T T n (l)x = 2n\\x\\ 2 , we also have that \T n (f)\ < 
27r||/||oo and \T n (f) \ < (27r)~ 1 ||l//|| 00 . To see the validity of the second 
inequality, we use the fact that ||^4 _1 || < if \\Ax\\ > c||x|| for all x. For 
/ € T, H/lloo < oo and ||l//||oo < oo. Furthermore, 

(6.2) ||T n (/)|| 2 = ^(n-|/ l |) 7 2 (/)<2vrn f f 2 {\)d\. 

\h\<n 7 w 

Using the preceding inequalities and the identity tr (AB) = tr(BA), it is 
straightforward to obtain the desired bounds on the mean and variance of 

log(p ( f n) /P?°)- 

□ 

The preceding lemmas can be combined to obtain the following theorem, 
where the constants £ and K are those introduced in Lemma 7. 

Theorem 7. Let be the distribution of (X±, . . . ,X n ) for a station- 
ary Gaussian time series {X t : t = 0, ±1, . . .} with spectral density f G J-. 
Assume that there exist constants T and M such that || log /" ||oo <T and 
J2h \h\7h(f) — M for every f G T . Let e n > l/y^n satisfy, for every j G N, 
sup log AT(^/2, {/ G F : ||/ - / || 2 < e}, \\ • ||oc) < ne 2 n , 

e>e„ 

n(/: ||/-/o|| 2 < j£) < Knelf/8 

n(/:||/-/o|| 2 <e) ~ 
Then P} n n) n(/ : ||/ - / || 2 > M n e n \Xi, ...,X n )^0for every M n oo. 
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7. Applications. In this section, we present a number of examples of 
application of the general results obtained in the preceding sections. The 
examples concern combinations of a variety of models with various prior 
distributions. 

7.1. Finite sieves. Consider the setting of independent, nonidentically 
distributed observations of Section 3. We construct sequences of priors, each 
supported on finitely many points such that the posterior distribution con- 
verges at a rate equal to the solution of an equation involving bracketing 
entropy numbers. Because bracketing entropy numbers are often close to 
metric entropy numbers, this construction exhibits priors for which the prior 
mass condition (2.5) is automatically satisfied. The construction is similar 
to that for the i.i.d. case given by Ghosal, Ghosh and van der Vaart [14], 
Section 3. However, in this case, some extra care is needed to appropri- 
ately define the bracketing numbers in the product space of densities. In the 
following, we consider a componentwise bracketing. 

Consider a sequence of models V^ n) = {P (n) : 9 € 0} of n-fold product 
measures Pq , where each measure is given by a density (xi, . . . , x n ) i— ► 
Y\2=iPd,i( x i) relative to a product-dominating measure /Xj. For a given 
n and e > 0, we define the componentwise Hellinger upper bracketing number 
for to be the smallest number N such that there are integrable nonnegative 
functions Uji for j = 1, 2, . . . , N and i = 1, 2, . . . , n, with the property that 
for any 6 6 0, there exists some j such that pg.i < Ujj for all i = 1, 2, . . . , n 
and Ya=i h 2 (pe,i,u jti ) 2 < ne 2 . We shall denote this by A^® (<■ , 9, d n ). 

Given a sequence of sets n | and e n — > such that log A"® (e n , n , d n ) < 

ne^, let (uj { : j = 1, 2, . . . , N, i = 1, 2, . . . ,n) be a componentwise Hellinger 
upper bracketing for n [where N = NF 1 ® (e n , n , d n )]. From this bracketing, 
we construct a prior distribution U n on the collection of densities of product 
measures, by defining U n to be the measure that assigns mass iV -1 to each of 
the joint densities p ™ = ®2=i( u j,i/ 1 n i/* dm), j = 1, 2, . . . , N. The collection 

V n = {pf^ '■ j = 1,2, .. . , N} forms a sieve for the models "p( n ) and can be 
considered as the parameter space for a given n. Although it is possible for 
the spaces V n to not be embedded in a fixed space, Theorem 4 still applies 
and implies the following result. 

Theorem 8. Let n j and 6 G 0. Assume that log N™®(s n , n , d n ) < 

ne 2 ^ for some sequence e n — > with ne 2 ^ — > cxo. Let U n be the uniform mea- 
sure on the renormalized collection of upper product brackets, as indicated 
previously. Then for all sufficiently large M , 

(7.1) Pfc ] nn(p in) : <%(p&\p in) ) > Msl\X x ,X 2 , . . . , X n ) -» 0. 
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Proof. As V n consists of finitely many points, its covering number with 
respect to any metric is bounded by its cardinality. Thus, (3.2) holds and 
(3.3) holds trivially. 

Let #o £ ©n for all n > uq. For a given n > uq, let jo be the index for 



which p 60)i < u j(hi and Ya=i h (pe 



ii u jo,i 



) < ne n- If p is a probability density, 



u is an integrable function such that u> p and v = u/ J u, then because 
2ab< (a 2 + b 2 ), it easily follows that h 2 (p,v) < (J udfj,)~ 1/2 h 2 (p,u). 

For any two probability densities p and q, we have (see, e.g., Lemma 8 of 
[17]) 



K(p,q)<h z (p,q)(l + log 



V(p,q)<h z (p,q)(l+log 



Together with the elementary inequalities 1 + logx < 2^fx and (l + logx) 2 
(4x 1//4 ) 2 = 16x 1//2 for all x > 1, the bounds imply that 



< 



K(p,q)<h 2 (p,q) 



1/2 



V(p,q)<h 2 (p,q) 



1/2 



Because (pe ,i/vj ,i) ^ / u h,idfJ>, it follows that n 1 Y%=i K (jPe ,h v 3o,i) ~ £ r 



and n" 1 £? =1 V(pe 0tl , v j0ti ) < e 2 . Thus, Ul 



1 v jo,i 



gets prior probability equal 



to N^ 1 > e~ n£n and hence relation (3.4) also holds for a multiple of the 
present e n . Thus, the posterior converges at the rate e n with respect to the 
metric d n . □ 



7.1.1. Nonparametric Poisson regression. Let X\, X2, ... be independent 
Poisson-distributed random variables with parameters ip(zi),ip(z2), ■ . ■ , where 
tp : K. — ► (0,oo) is an unknown increasing link function and z±,Z2,--- are 
one-dimensional covariates. We assume that L < tp < U for some constants 
< L < U < 00. 

If Z <tp <u, then for any z and x, we have e~^ <yZ \ip{z)) x / x\ < e~ llyZ \u{z)) x / 
x\. For a pair of link functions / < u, let qi u ( x , z ) = e~ l ( z \u(z)) x jx\ and put 

fl°u \xi,X2,...,x n ) =U.i=iQi,u( x ii z i)- For an y constants L < Ai, A 2 , fn, /U 2 < 
U, we have 

00 

E 



2=0 



..x \ 1/2 / x \ l/2\ 2 



= ( e -(Ai+/*i)/2 _ e -(A 2 +M2)/2^2 + 2 e -(Ai+A 2 )/2^ e (Mi+At2)/2 _ eV //HM2~) 

Let Ii < u\ and Z 2 < 1*2 be two pairs of link functions taking their values 
in the interval [L, U]. Therefore, with = n~ 1 J27=l being the empirical 
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distributions of z\, z 2 , . . . , z n , we have that d 2 n (f\^ Ui , f\^ U2 ) < J(\h - l 2 \ 2 + 
\u\ — U2\ 2 )dP^. Hence, an e-bracketing of the link functions with respect 
to the L2(Vn)' me ^ c yields a componentwise Hellinger upper bracketing 
whose size is a multiple of e. Now the e-bracketing entropy numbers of the 
above class are bounded by a multiple of relative to any Z/2-metric (cf. 
Theorem 2.7.5 of [31]). Equating this with ne 2 , we obtain the rate n" 1 / 3 for 
posterior convergence, which is also the minimax rate, relative to d n . 

In this example, the normalized upper brackets for the densities are also 
Poisson mass functions corresponding to the link functions equal to the 
upper brackets. Hence, the prior can be viewed as charging the space of 
link functions and the distance d n can also be induced on this space. This 
makes interpretations of the prior and the posterior, as well as the posterior 
convergence rate, more transparent. Further, as the space of link functions 
is a fixed space, proceeding as in Theorem 3.1 of [14], a fixed prior not 
depending on n may be constructed such that the posterior converges at the 
same n" 1 / 3 rate. 

7.2. Linear regression with unknown error distribution. Let X±, . . . ,X n 
be independent regression response variables satisfying Xi = a + f3z{ + £j, 
i = 1,2, . . . ,n, where the Zj's are nonrandom one-dimensional covariates lying 
in [-L, L] for some L and the errors e% are i.i.d. with density / following 
some prior n. Amewou-Atisso et al. [1] studied posterior consistency under 
this setup. Here, we refine the result to a posterior convergence rate. Assume 
that < C for all x and all / in the support of n. The priors for a 

and (5 are assumed to be compactly supported with positive densities in the 
interiors of their supports and all the parameters are assumed to be a priori 
independent. Let the true value of (f,a,f3) be {fo,cto, (3$), an interior point 
in the support of the prior. 

Let H{e) be a bound for the Hellinger e-entropy of the support of n and 
suppose that fo(x)/ f(x) < M(x) for all x, where / M 5 fa < oo, 5 > 0. Then 
by Theorem 5 of [33], it follows that max{K(f , /), V(f , /)} < h 2 (f ,f) x 
log 2 (l//i(/o, /)). Let a(e) = —logH(h(fo,f) < e). The posterior convergence 
rate for density estimation is then e n , given by 

(7.2) max.{H{e n ),a(e n /(\oge~ 1 ))} < ne 2 n . 

The following theorem shows that Euclidean parameters do not affect the 
rate. 

Theorem 9. Under the above setup, if fo(x — ctQ — Pqz)/ f(x — a — 
(3z) < M(x) for all x,z,a,(3, then the joint posterior of (a,f3,f) concen- 
trates around (ao,A)j/o) a ^ the rate e n defined by (7.2), with respect to d n . 
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PROOF. We have, by {a + bf < 2(a 2 + 6 2 ), that /i 2 (/i(- -oti -faz), / 2 (- - 
U2-P2Z)) <2/i 2 (/i,/ 2 ) +4C 2 |ai -a 2 | 2 + 4C 2 L 2 |/3i -/? 2 | 2 , which leads to 

. P hL^ % ^(/i. /a) + 1*1 " «2| 2 + " &| a 

and hence the d n -entropy of the parameter space is bounded by a multiple 
of H(e) + log±<H(e). 

To lower bound the prior probability of B n ((fo,ao, @o),e;2) defined by 
(3.5), by Theorem 5 of [33] with h = h(fo(- — ao — (3oz),f(- — a — f3z)), we 
have that K(f (- - a - (3 z),f(- - a- (3z)) < /i 2 log± and V(f (- - a - 
0oz),f(- -a- /3z)) < /i 2 log 2 i. Thus, a multiple of e -2 e -ca( £ /io ge -i) lower 
bounds the prior probability of (3.5) and the first factor can be absorbed 
into the second, where c is a suitable positive constant. Thus, Theorem 4 
implies that the posterior convergence rate with respect to d n is e n . □ 

More concretely, if the prior is a Dirichlet mixture of normals (or its 
symmetrization) with the scale parameter lying between two positive num- 
bers and the base measure having compact support, and if the true error 
density is also a normal mixture of this type, then by Ghosal and van der 
Vaart [16], it follows that the convergence rate is (logn) / y/n. The assump- 
tion of compact support of the base measure can be relaxed by using sieves. 
Compactness of the support of the prior for a and (3 may be relaxed by 
using sieves \a\ < clogn if these priors have sub-Gaussian tails. Also, it is 
straightforward to extend the result to a multidimensional regressor. For 
more general error densities, one has to allow arbitrarily small scale param- 
eters and apply the results of Ghosal and van der Vaart [17] to obtain a 
slower rate. 

Often, only the Euclidean part is of interest and an re -1 / 2 rate of con- 
vergence is generally obtained in the classical context. The posterior of 
the Euclidean part is also expected to converge at an n -1 / 2 rate and the 
Bernstein-von Mises theorem may hold; see [26] for some results. However, 
as we consider (/, a, (3) together and obtain global convergence rates, it seems 
unlikely that our methods will yield these improved convergence rates for 
the Euclidean portion of the parameter. 

7.3. Whittle estimation of the spectral density. Let {Xt : t £ Z} be a sec- 
ond order stationary time series with mean zero and autocovariance function 
7 r = E(XfXt +r ). The spectral density of the process is defined (under the 
assumption that J2 r W < 00 ) by /(A) = ^ J2?L-oo "f r e~ tr7TX , A £ [0, 1]; here, 
we have changed the original domain [— tt,tt] of spectral density to [0, 1] by 
using symmetry and then rescaling. Let I n (X) = (27rn) _1 | J2?=i X t e~ ltnX \ 2 , 
A G [0,1], denote the periodogram. Because the likelihood is complicated, 
Whittle [32] proposed as an approximate likelihood that of a sample Ui, . . . , U l 
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of independent exponential variables with means f(2j/n), j = 1, . . . ,u, eval- 
uated with Uj = I n (2j/n), where v = [n/2\ . The Whittle likelihood is moti- 
vated by the fact that if X U) i — > Aj, i = 1, . . . ,m, then under reasonable condi- 
tions such as mixing conditions, (/ n (A n ,i), • • • ,In(^n,m)) converges weakly to 
a vector of independent exponential variables with mean vector (/(Ai), . . . , 
/(^m)); see, for instance, Theorem 10.3.2 of Brockwell and Davis [6]. 
Dahlhaus [10] applied the technique of Whittle likelihood to estimating the 
spectral density by the minimum contrast method. A consistent Bayesian 
nonparametric method has been proposed by Choudhuri, Ghosal and Roy [7]. 
Below, we indicate how to obtain a rate of convergence using Theorem 4. 

As in the proof of consistency, we use the contiguity result of Choudhuri, 
Ghosal and Roy [8], which shows that for a Gaussian time series, the se- 
quence of laws of (I n (2/n), . . . , I n {2v/n)) and the sequence of approximating 
exponential distributions of (U±, . . . , U u ) are contiguous. Thus, a rate of con- 
vergence of the posterior distribution under the actual distribution follows 
from a rate of convergence under the assumption that Ui, . . . ,U U are exactly 
independent and exponentially distributed with means f(2/n), . . . , /(2^/n), 
to which Theorem 4 can be applied. 

Let ^(/i, f 2 ) = v- 1 EiU(/i(2*A0 " f2(2i/n)f. If h and f 2 are spectral 
densities with m < /i, f 2 < M pointwise, then it follows that 

( ? - 3 ) 4M2%(h,f2) < d 2 n (h,f 2 ) < < ^\\h - MIL 

where d n is given by (3.1) and || • ||oo is the uniform distance. If the spec- 
tral densities are Lipschitz continuous, then a rate for the discretized L 2 - 
distance d n will imply a rate for the ordinary L2-distance || • H2 by the 
relation — /2H2 ^ d n (fi,f2) + (L + M)/n, where L and M are the Lip- 
schitz constant and uniform bound, respectively. To see this, note that 
s/n/vd n (f,0) = \\fnh, where f n = T!j=i f( 2 J / n ) l ((2j-2)/n,2j/n] and hence 

Lip / 2v\ 



n/vd n {fM < 11/ - f n \\ 2 < + 1 I 

n \ n ) 

It follows that for the verification of (3.2), we may always replace d n by 
d n and if the spectral densities are restricted to Lipschitz functions with 
Lipschitz constant L n and where e n S> L n /n, then we may also replace d n 
by the L2-norm || • H2. 

Now, by easy calculations, for all spectral densities /, /o taking values 
in [m, M], we have that v~ l ELi K(P foti ,P f J < ^(/ , /) < ||/ - /o||^ and 

l/ ~ 1 Y l i=iV2,o(Pfo,i,Pf,i) ^ ^(/o,/) < 11/ - /o||L hence it suffices to esti- 
mate the prior probability of sets of the form {/ : ||/ — /o||oo < e}- Alter- 
natively, if the spectral densities under consideration are Lipschitz, then we 
may estimate the prior mass of an L2-ball around /q. 
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As a concrete prior, we consider the prior used by Choudhuri, Ghosal and 
Roy [7], namely / = rq, where r = var(A^) has a nonsingular prior density 
and q, a probability density on [0, 1], is given the Dirichlet-Bernstein prior 
of Petrone [24]. We then restrict the prior to the set K = {/:m</< M}. 
The order of the Bernstein polynomial, k, has prior mass function p, which 
is assumed to satisfy e -A fcl °g fc < p(fc) < e~^ 2k . Let II denote the resulting 
prior. 

Clearly, as /o G K, restricting the prior to K can only increase the prior 
probability of {/ : ||/ - / ||oo < e}- Therefore, following Ghosal [12], Il(||/ - 
/olloo < £ ) ^ e_ce loge • Hence, e n of the order re~ 1 / 3 (logn) 1 ' 3 satisfies 
(3.4). 

Consider a sieve J- n for the parameter space K, which consists solely of 
Bernstein polynomials of order k n or less. All of these functions have Lips- 
chitz constant at most fe 2 and are uniformly bounded away from zero and 
infinity by construction. The e-entropy of J- n relative to d n can be bounded 
above by that of the simplex, which is further bounded above by k log k + 
A; loge -1 . Hence, by choosing k n of the order n 1 / 3 (logn) 2 / 3 , the convergence 
rate at /o on T n with respect to d n is given by max(n -1 / 2 fen / ' 2 (logn) 1 / 2 , 
n-V3(iog n )V3, jfc2/ n ) = n _1 /3(logn) 4 / 3 . Now, U(J=^) = p(k > k n ) < e"^" = 

e -/3n 1 /3 (logn) 2/3 = e _ /3n(n -l/3 (logn) l/3 ) 2^ pogterior probability f J7C 

goes to zero by Lemma 1 and hence the convergence rate on K is also 
n~ 1 / 3 (logn) 1 / 3 . The minimax rate n -2 / 5 may be obtained, for instance, by 
using splines, which have better approximation properties. 

7.4. Nonlinear autoregression. Consider the nonlinear autoregressive model 
in which we observe the elements X\ , . . . , X n of a stationary time series 
{Xt :t€i'L} satisfying 

(7.4) X i = f{X i _ l )+e h i = l,2,...,n, 

where / is an unknown function and E\,ei,...,e n are i.i.d. N(0,a 2 ). For 
simplicity, we assume that a = 1. Then X n is a Markov chain with transition 
density Pf(y\x) = 4>(y — f(x)), where 4>{x) = (27r) _1 / 2 e _x I 2 . Assume that 
/ G J 7 , a class of functions such that |/(x)| < M and \f(x) — f(y)\ < L\x — y\ 
for all x, y and / G T . 

Set r(y) = \{(f>(y - M) + <f>(y + M)). Then r(y) < p f (y\x) < r(y) for all 
x,y G M and / G T. Further, sup{J \vW\ x ^) ~ vW\ Xt i)\^V '■ x ii x 2 G M} < 2. 
Hence, the chain is a-mixing with exponentially decaying mixing coefficients 
and has a unique stationary distribution Qf whose density qf satisfies r < 
q f <r. Let = (/ \f\° dr)V». 

Because h 2 (N(pi, 1), N(fj, 2 , 1)) = 2[1 - exp(—\pi — ^ 2 | 2 /8)], it easily fol- 
lows for /i,/2 G d defined in (4.2) and dv = rdX that ||/i — /2H2 ^ 
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d{fi, f2) % \\fi — y 2 1 1 2 - Thus, we may verify (4.5) relative to the L2(V)-metric. 
It can also be computed that 

^/o^^|^ = ^/(/o-/) 2 %^<||/-/o||I, 



log 



Pf {X 2 \Xi) 



Pf{X 2 \Xi) 



J \k-f\ s q fo d\<\\f-fo\ 



Thus, B*(fo,e;s) D {/ : ||/ — /o|| s < ce] for some constant c > 0, where 
B*(fo,e; s) is as in Theorem 5. Thus, it suffices to verify (4.7) with s > 2. 



7.4.1. Random histograms. As a prior on the regression functions /, 
consider a random histogram as follows. For a given number KgN, par- 
tition a given compact interval in R into K intervals I\, . . . ,1k and let 
Iq = R \ U k Ik- Let the prior n n on / be induced by the map ai->/ a given 
by f a = J2k=i a k^i k > where the coordinates a±, . . . , ax of a E M. K are chosen 
to be i.i.d. random variables with the uniform distribution on the interval 
[— M, M] and where K = K n is to be chosen later. Let r(Ik) = Jj rdX. 

The support of Il n consists of all functions with values in [-M, M] that 
are piecewise constant on each interval Ik for k = 1, . . . , K and which van- 
ish on Iq. For any pair f a and fp of such functions, we have, for any 
s £ [2,oo], \\f a — fp\\ 8 = || ck — (3\\ s , where ||q|| s is the r-weighted ^ s -norm 
of a = (ai, . . . ,ax) G R^ given by ||a||g = J2k \ a k\ s r(Ik)- The dual use of 
|| • || s should not lead to any confusion as it will be clear from the context 
whether || • || s is a norm on functions or on vectors. The Z/2(r)-projection of 
/o onto this support is the function f ao for a^fc = fj for d\/r(Ik), whence, 
by Pythagoras' theorem, ||/ Q - f \\l = \\f a - / Qo ||| + ||/ ao - f \\l for any 
a £ \—M, M] K . In particular, ||/ Q - / || 2 > c|| a — aolb f° r some constant c 
and hence, with J- n denoting the support of II n , 

iV(e,{/€^ n :||/-/ ||2<16e},||-||2) 

< N(e, {a£R K : \\a - a \\ 2 < 16ce}, || • || 2 ) < (80c) K , 

as m Lemma 4.1 of [25]. Thus, (4.5) holds if ne 2 n > K. 
To verify (4.7), note that for A = (A(Ji), . . . , X(I K )), 

\\fao-fo\\ s s = I \fo\ s d\ + Y,f \a ,k-fo\ s rd\<M s r(I ) + L s \\\\\ s s . 
Jlo k Jl k 

Hence, as /o G T , for every a G [-M, M] K , 
Ufa ~ fo\U < \\a - a \\ s + rilo) 1 / 3 + ||A|| S < \\a - a ||oo + r(I ) 1/s + ||A|| S , 



20 S. GHOSAL AND A. W. VAN DER VAART 

where || • ||oo is the ordinary maximum norm on R K . For r(I ) 1/s + ||A|| S < 
e/2, we have that {/ : ||/ - / || s < e} D {f a : \\a - a \\oo < e/2}. Using \\a - 
c^o 1 1 2 < c\\f a — y*o 1 1 2 , for any e > such that r(Io) 1 ^ + ||A|| S < e/2, we have 

H n (/ : II/- /0II2 < je) < n n (a: \\a - a \\ 2 < je) 
n n (/ : 11/ - /oils < e) Il n (a : \\a - a ||oo < ec/2) ' 

We show that the right-hand side is bounded by e Cne2 ^ s for some C . 

For Ufc-^fci a regular partition of an interval [—A, A], we have that ||A|| S = 
2A/K and since r(Ifc) > A(Ife) inf x . G / fc r(x) for every k > 1, the norm || • H2 
is bounded below by y/2A(j)(A)/K > y / 4>(A)/K times a multiple of the Eu- 
clidean norm. In this case, the preceding display is bounded above by 

(Cje^K/0(A)/(2M)) K vol K f jV2lTe \ K 1 

(ec/{AM)) K ~\^(Aj) V^K' 

by Stirling's approximation, where vol^- is the volume of the X-dimensional 
Euclidean unit ball. The probability r(io) is bounded above by 1 — 2&(A) < 
<j>{A). Hence, (4.7) will hold if K log(l/0(A)) < ne 2 , 0(A) < 4 and A/K < 
e n . All requirements are met for e n equal to a multiple of n _1 / 3 (logn) 1 / 2 
[with K ~ \/l g(l/ e '«) e n 1 anci ^ ~ ^/log(l/e n )]. This is only marginally 
weaker than the minimax rate, which is n -1 / 3 for this problem, provided 
the autoregression functions are assumed to be only Lipschitz continuous. 

The logarithmic factor in the convergence rate appears to be a conse- 
quence of the fact that the regression functions are defined on the full real 
line. The present prior is a special case of a spline-based prior (see, e.g., 
Section 7.7). If / has smoothness beyond Lipschitz continuity, then the use 
of higher order splines should yield a faster convergence rate. 



7.5. Finite- dimensional i.n.i.d. models. Theorem 4 is also applicable to 
finite-dimensional models and yields the usual convergence rate as shown 
below. The result may be compared with Theorem 1.10.2 of [19] and Propo- 
sition 1 of [13]. 

Theorem 10. Let Xi, . . . ,X n be i.n.i.d. observations following densities 
po.i, where O C M. d . Let 6q be an interior point of O. Assume that there exist 
constants a > and < c; < Q < 00 with, for every 6, 61,62 G O, 

n 1 n 

(7.5) c = liminf — q > 0, C = limsup — Cj < 00 

1=1 1=1 

such that P do ^g P -^f)< 0^6-6^, P0 o>i (log^) 2 <Q||0-0o|| 2a and 



(7.6) || 6»i - 6 2 \\ 2a < h\p euU pe 2 ,i) < dpi 
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Assume that the prior measure II possesses a density tt which is bounded 
away from zero in a neighborhood of 9q and bounded above on the entire pa- 
rameter space. Then the posterior converges at the rate n" 1 /^ 2 ") with respect 
to the Euclidean metric. 

For regular families, the above displays are satisfied for a = 1 and the 
usual n~ x l 2 rate is obtained; see [19], Chapter III. Nonregular cases, for 
instance, when the densities have discontinuities depending on the parameter 
[such as the uniform distribution on (0,9)], have a < 1 and faster rates are 
obtained; see [19], Chapters V and VI and [13]. 

Proof of Theorem 10. By the assumptions (7.5) and (7.6), it suffices 
to show that the posterior convergence rate with respect to d n defined by 
(3.1) is ra" 1 / 2 . Now, by Pollard ([25], Lemma 4.1), 

N(e/I8,{9£&:d n (9,9 )<e},d n ) 
(7 7) < N((e 2 /(36C)) 1 ^ 2a \ {9 6 9 : \\0 - 9 \\ < (2e 2 /c) 1 ^}, || • ||) 

which verifies (3.2). For (3.4), note that 

n(6> : d n (p e ,p 9o ) < je) 

11(9 : n" 1 E?=i Ki(e ,9) < e 2 ,n^ £?=l y 2;l (0 o , 9) < e 2 ) 

U(9:\\9-9 \\<(2j 2 e 2 /c)^) d/a 
~ 11(9 : \\9 - \\ < (e 2 /(2C)) 1 /(2a)) - 2 

for sufficiently small e > 0, where A is a constant depending on d, c, C and 
the upper and lower bounds on the prior density. The conclusion follows for 
e n = M/y/n, where M is a large constant. □ 

The condition that the Hellinger distance is bounded below by a power 
of the Euclidean distance excludes the possibility of unbounded parameter 
spaces. This defect may be rectified by applying Theorem 3 to derive the 
rate. If there is a uniformly exponentially consistent test for 9 = 9q against 
the complement of a bounded set, then the result holds even if is not 
bounded. Often, such tests exist by virtue of bounds on log affinity, as in 
the case of normal distributions, or by large deviation type inequalities; see 
[20] and [14], Section 7. Further, if the prior density is not bounded above, 
but has a polynomial or sub exponential majorant, then the rate calculation 
also remains valid. 
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7.6. White noise with conjugate priors. In this section, we consider the 
white noise model of Section 5 with a conjugate Gaussian prior. This allows 
us to complement and rederive results of Zhao [34] and Shen and Wasser- 
man [27] in our framework. Thus, we observe an infinite sequence Xi,X2, ■ ■ ■ 
of independent random variables, where X{ is normally distributed with 
mean 9{ and variance n _1 . 

We consider the prior U n on the parameter 9 = (61,82, ■■ ■) that can be 
structurally described by saying that 9\,...,9k are independent with Oi nor- 
mally distributed with mean zero and variance o~ 2 k and that 9k+i,9k+2, ■ ■ ■ 
are set equal to zero. Here, we choose the cutoff k dependent on n and equal 
to k = [n 1 /( 2cf+1 )j for some a > 0. Zhao [34] and Shen and Wasserman [27] 
consider the case where a 2 k = i~( 2Q+1 ) for i = 1, . . . , k and show that the 

convergence rate is e n = n~ a ^ 2a+1 ^ if the true parameter #0 is "a-regular" 
in the sense that Ya^=i ^i* 2 " < 00. We shall obtain the same result for any 
triangular array of variances such that 

(7.8) mm{al k i 2a :l<i< kj^k' 1 . 

For instance, for each k, the coefficients 9\,...,9k could be chosen i.i.d. 
normal with mean zero and variance k~ l or could follow the model of the 
authors mentioned previously. 

Theorem 11. If k ~ n l ^ 2a+l ^ and (7.8) holds, then the posterior con- 
verges at the rate e n = n~ a ^ 2a+l ^ for any 6q such that J2i^Li ^oi* 2Q < 00 • 

Proof. The support O n of the prior is the set of all 9 € £2 with B{ = 
for i > k and can be identified with ]R fc . Moreover, the ^2-norm || • || on the 
support can be identified with the Euclidean norm || • on M. k . Let Bk(x,e) 
denote the fc-dimensional Euclidean ball of radius e and center iGlf For 
any true parameter 9q£ £2, we have \\9 — 9q\\ > ||Pf — P#o||fc, where P is the 
projection on @ n , and hence 

N(e/8, {9 € @ n : ||0 - O || < e}, || • ||) < N(e/8, B k (P9 ,e), \\ ■ \\ k ) < (40) fc . 

It follows that (5.1) is satisfied for ne n > k, that is, in view of our choice of 
k, e n >n~ a /( 2a + 1 \ 

By Pythagoras' theorem, we have that \\9 — 9o\\ 2 = \\P9 — P^o|| 2 + Z)j>fc ^oi 
for any 9 in the support of Ii n . Hence, for J2i>k@oi — £ n /2, we have that 

Un (9 G 6 n : \\9 -9 \\< e n ) > U n (9 G R k : \\9 - P9 \\ k < e n /2). 

By the definition of the prior, the right-hand side involves a quadratic form 
in Gaussian variables. For £ the k x k diagonal matrix with elements o~f k , 
the quotient on the left-hand side of (5.2) can be bounded as 

U n (9e@ n :\\9-9 \\<je n ) N k (-P6 ,X)(B(0,je n )) 
n n (0Ge n :||0-0 o ||<£n) " iV fe (-P0o,£)OB(O,£n/2))' 
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The probability in the numerator increases if we center the normal distribu- 
tion at rather than at —P8q, by Anderson's lemma. Furthermore, for any 

diV fc (0,S/2) 1 ; ^-ELW,. " 

Therefore, we may recenter the denominator at at the cost of adding the 
factor on the right (with fx = 8q) and dividing the covariance matrix by 2. 
We obtain that the left-hand side of (5.2) is bounded above by 

iV fc (0,E/2)(S(0,e n /2)) 

< a*/^^,/^ f g *V jy fc (o,agJ)(g(o,j e n)) 

\<lJ N k (0,*lI/2)(B(0,e n /2)y 

where a k and q_ k denote the maximum and the minimum of <jj k for i = 
1,2, ... ,k. The probabilities on the right-hand side are left tail probabilities 
of chi-square distributions with k degrees of freedom, and can be expressed 
as integrals. The preceding display is bounded above by 

k ff^/-l x k/2-l e -x/2 dx 



a k j j^/(^D x k/2-i e -x/2 dx 

The exponential in the integral in the numerator is bounded above by 1 
and hence this integral is bounded above by j k e k l /(ka k ). We now consider 
two separate cases. If £%/g\ remains bounded, then we can also bound 
the exponential in the integral in the denominator below by a constant 
and have that the preceding display is bounded above by a multiple of 
4 k j k exp(X^ =1 Ooi/of k ). If e\ /a 2 k — >■ oo, then we bound the integral in the de- 
nominator below by (r//2) fc//2_1 J^ 2 e~ x > 2 dx for 77 = e 2 /(2a 2 k ). This leads to 
the upper bound being a multiple of 8 k j k exp(J2i=i &o i^Jk, ) e nQL~k 2 ex P( e n£Lfc 2 / '8) • 
By the assumption (7.8), we have that a\ > k~( 2a + 1 ) nT 1 . We also have 
that k ~ rae 2 . It follows that £ 2 /cr 2 < ne\ and that a^ 2 is bounded by a 
polynomial in k. We conclude that with our choice of k ~ n 1 / ( - 2Q+1 \ (5.2) is 

satisfied if e„ satisfies J2i=i Qo,i/ a i,k ~ n£ n an< ^ ^2i>k®o,i < e n/ 2 - 

It follows that the posterior concentrates at Oq at the rate e n that satisfies 
these requirements as well as the condition e n > n~ a /( 2a+1 \ If the true pa- 
rameter #0 satisfies Ya^Li C' 2 " < 00 > then all three inequalities are satisfied 
for e n a multiple of n~ a ^ 2a+1 \ The rate n~ a ^ 2a+l ^ is the minimax rate for 
this problem. □ 
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Our prior is dependent on n, but with some more effort, it can be seen 
that the same conclusion can be obtained with a mixture prior of the form 
J2 n ^n^-n for suitable A n . 

7.7. Nonparametric regression with Gaussian errors. Consider the non- 
parametric regression model, where we observe independent random vari- 
ables X\, . . . ,X n distributed as Xi = f(zi) + e, for an unknown regression 
function /, deterministic real- valued covariates z±, . . . , z n and normally dis- 
tributed error variables £\,...,£ n with zero means and variances a 2 . For 
simplicity, we assume that the error variance a 2 is known. We also suppose 
that the covariates take values in a fixed compact set, which we will take as 
the unit interval, without loss of generality. 

Let /o denote the true value of the regression function, let Pfi be the 

distribution of Xi and let P^ be the distribution of (X±, . . . ,X n ). Thus, 
Pf : i is the normal measure with mean f(zi) and variance a 2 . Let = 
n _1 Ya=i be the empirical measure of the covariates and let || • || n denote 
the norm on L2(P^). 

By easy calculations, K(P f0:i , P f)i ) = \f (zi)-f(zi)\ 2 /(2a 2 ) and V 2 ,o(Pf ,i, 
Pf,i) = \fo( z i) ~ f( z i)\ 2 / (7 ' 2 f° r alH = 1, 2, . . . , n, whence the average Kullback- 
Leibler divergence and variance are bounded by a multiple of ||/o — /lln/ " 2 
and hence it is enough to quantify prior concentration in || • || n -balls. The 
average Hellinger distance, as used in Theorem 4, is bounded above by || • || n , 
but is equivalent to this norm only if the class of regression functions is uni- 
formly bounded, which makes it less attractive. However, it can be verified 
(cf. [5]) that the likelihood ratio test for /o versus f\ satisfies the conclusion 
of Lemma 2 relative to || • || n (instead of d n and 9i = fi). Therefore, we may 
use the norm || • || n instead of the average Hellinger distance throughout. 

We shall construct priors based on series representations that are appro- 
priate if /o £ C Q [0, 1], where a > could be fractional. This means that /o is 
ao times continuously differentiable with ||/o||q < oo, ao being the greatest 
integer less than a and the seminorm being defined by 

(7-9 /a=BUp : j-— • 

X -L xl \X-X'\ a a ° 

7.7.1. Splines. Fix an integer q with q > a. For a given natural number 
K, which will increase with n, partition the interval (0, 1] into K subintervals 
((k — 1)/K,k/K] for k = 1,2, ... ,K. The space of splines of order q relative 
to this partition is the collection of all functions / : (0, 1] — > R that are q — 2 
times continuously differentiable throughout (0, 1] and, if restricted to a 
subinterval ((k — 1)/K,k/K], are polynomials of degree strictly less than 
q. These splines form a J = {q + K — 1) -dimensional linear space, with a 
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convenient basis Bi,B2, ■ ■ ■ , Bj being the B-splines, as defined in, for exam- 
ple, [11]. The B-splines satisfy (i) Bj > 0, j = 1,2,..., J, (ii) £)/=i Bj = 1, 
(hi) Bj is supported inside an interval of length q/K and (iv) at most q of 
B±(x), . . . , Bj(x) are nonzero at any given x. Let B(z) = (Bi(z), . . . , Bj(z)) T 
and write /3 T .B for the function z i— > ^Zj PjBj(z). 

The basic approximation property of splines proved in [11], page 170, 
shows that for some /3oo £ R J (dependent on J), 

(7.10) -/ ||oo< J- a ||/olU. 

Thus, by increasing J appropriately with the sample size, we may view the 
space of splines as a sieve for the construction of the maximum likelihood 
estimator, as in Stone [28, 29], and for Bayes estimates as in [14, 15] for the 
problem of density estimation. 

To put a prior on /, we represent it as fp(z) = (3 T B(z) and induce a 
prior on / from a prior on (3. Ghosal, Ghosh and van der Vaart [14], in the 
context of density estimation, choose j3±, . . . ,/3j i.i.d. uniform on an interval 
[— M, M], the restriction to a finite interval being necessary to avoid densities 
with arbitrarily small values. In the present regression situation, a restriction 
to a compact interval is unnecessary and we shall choose /3\, . . . ,j3j to be a 
sample from the standard normal distribution. 

We need the regressors z±, Z2, ■ ■ ■ , z n to be sufficiently regularly distributed 
in the interval [0, 1] . In view of the spatial separation property of the B-spline 
functions, the precise condition can be expressed in terms of the covariance 
matrix S n = (/ B{Bj cflP^), namely 

(7.11) j-'wpf^^n^j-'m 2 , 

where || • || is the Euclidean norm on R* 7 . 

Under condition (7.11), we have that for all /3i,/?2 £ R J , 

(7.12) C||A - < y/JWfh - / A || n < C"||fr - (3 2 \\ 

for some constants C and C . This enables us to perform all calculations in 
terms of the Euclidean norms on the spline coefficients. 

Theorem 12. Assume that the true density fo satisfies (7.10) for some 
a > 2, let (7.11) hold and let Tl n be priors induced by a Nj(0,I) distribution 
on the spline coefficients. If J = J n ~ n l ^ l+2a \ then the posterior converges 
at the minimax rate n^ a ^ 1+2a ^ relative to \\ ■ \\ n . 

Proof. We verify the conditions of Theorem 4. Let fp n be the L2(P^)- 
projection of /o onto the J-dimensional space of splines fp = 1 B. Then 
WfPn ~ fp\\n < ll/o - fp\\n for every j3 E R J and hence, by (7.12), for every 
e >0, we have {(3 : \\f p - f \\ n < e} C {/?: <C'VJe}. It follows that 
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the e-covering numbers of the set {fp : \\fp — /o||n < e} for || • || n are bounded 
by the Cy/le -covering numbers of a Euclidean ball of radius C 'V~Je, which 
are of the order D J for some constant D. Thus, the entropy condition (3.2) 
is satisfied, provided that J<ne^. 

By the projection property, with [3^ as in (7.10), 

(7.13) \\fp n - /o|U < \\f Poa ~ /o||n < 11/^ - /olloo < 

Combining this with (7.12) shows that there exists a constant C" such that 
for every e > 2J~ a , {(3 : \\fp-fo\\ n < e} D {(3 : \\(3-(3 n \\ < C"VJe}. Together 
with the inclusion in the preceding paragraph and the definition of the prior, 
this implies that 

nn(/:||/-/o||n<jg) < Nj(0,I)(P:\\P-l3 n \\<C'jVJe) 
n n (/:||/-/o||n<e) " Nj(0,I)((3:\\(3-f3 n \\<C"jVJe) 

Nj(0,I)((3:\\f3\\<C'jVJe) 



< 



2-J/2 e -WPn\\ 2 Nj(0,I)((3 : \\/3\\ < C"j\/le/y/2) 



In the last step, we use Anderson's lemma to see that the numerator in- 
creases if we replace the centering (3 n by the origin, whereas to bound the 
denominator below, we use the fact that 

dNj(0,I/2) {P) (V2)-V-M a - 

Here, by the triangle inequality, (7.12) and (7.13), we have that \\/3 n \\ < 
V^II/ftJIn ^5 VJ(J~ a + || /olloo ) ^ y/J- Furthermore, the two Gaussian prob- 
abilities are left tail probabilities of the chi-square distribution with J de- 
grees of freedom. The quotient can be evaluated as 



j(C')W x J/2-l e - X /2 dx 
j(C^JeV2 xJ/2 _ le _ x/2dx 



This is bounded above by (Cj) J for some constant C if \Tje remains bounded. 
Hence, to satisfy (3.4), it again suffices that ne 2 n > J. 

We conclude the proof by choosing J = J n ~ n 1 ^ 1+2aS) . □ 

7.7.2. Orthonormal series priors. The arguments in the preceding sub- 
section use the special nature of the B-spline basis only through the approx- 
imation inequality (7.10) and the comparison of norms (7.12). Theorem 12 
thus extends to many other possible bases. One possibility is to use a se- 
quence of orthonormal bases with good approximation properties for a given 
class of regression functions /o- Then (7.11) should be replaced by 

(7.14) ||A-/%||<||//Si-/ft||n<||/3l-M 
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This is trivially true if the bases are orthonormal in L2(P^), but this re- 
quires that the basis functions change with the design points z\ , . . . , z n . One 
possible example is the discrete wavelet bases relative to the design points. 
All arguments remain valid in this setting. 

7.8. Binary regression. Let X\, . . . , X n be independent observations with 
P(X, = 1) = 1 — P(Aj = 0) = F(a + 0Zi), where Z{ is a one-dimensional co- 
variate, a and j3 are parameters and F is a cumulative distribution. Within 
the parametric framework, logit regression, where F(z) = (1 + e _z ) _1 , or 
probit regression, where F is the cumulative distribution function of the 
standard normal distribution, are usually considered. Recently, there has 
been interest in link functions of unknown functional form. The parameters 
(F,a,f3) are separately not identifiable, unless some suitable restrictions 
on F (such as given values of two quantiles of F) are imposed. For Bayesian 
estimation of (F, a, (3), one therefore needs to put a prior on F that con- 
forms with the given restriction. However, in practice, one usually puts a 
Dirichlet process or a similar prior on F and, independently of this, a prior 
on (a,/3), and makes inference about, say, Zo, where F[a + /3zq) = 1/2. Re- 
cently, Amewou-Atisso et al. [1] showed that the resulting posterior is con- 
sistent. In this section, we obtain the rate of convergence by an application 
of Theorem 4. 

Because we directly measure distances between the distributions generat- 
ing the data, identifiability issues need not concern us. The model and the 
prior can thus be described in a simpler form. We assume that X\,X2, ■ ■ ■ 
are independent Bernoulli variables, X{ having success parameter H(zi) for 
an unknown, monotone link function H. As a prior on H, we use the Dirich- 
let process prior with base measure — a)/P), for "hyperparameters" 
{a, pi) distributed according to some given prior. This results in a mixture 
of Dirichlet process priors for H . Let the true value of H be H$, which is 
assumed to be continuous and nondecr easing. 

In practice, 7 is often chosen to have support equal to the whole of E and 
(a,P) chosen to have support equal to M x (0, 00) so that the conditions on 
7 and (a,P) described in the following theorem are satisfied. 

Theorem 13. A ssume that z± } Z2, ■ ■ ■ , z n lie in an interval [o, 6] strictly 
within the support of the true link function Hq so that Hq(o—) > and 
Ho(b) < 1. Let H be the given mixture of Dirichlet process priors described 
previously with') and (a,P) having densities that are positive and continuous 
inside their supports. Assume that there exists a compact set K inside the 
support of the prior for (a,P) such that whenever (a,P) £ K, the support of 
the base measure 7((- —ct)/p) strictly contains the interval [a,b]. Then the 
posterior distribution of H converges at the rate n~ 1 / 3 (logn) 1 / 3 with respect 
to the distance d n given by (3.1). 
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Proof. Because the Hellinger distance between two Bernoulli distribu- 
tions with success parameters p and q is equal to (p 1 ^ 2 — q 1 ^ 2 ) 2 + ((1 — p) 1 / 2 — 
(1 - q) 1 / 2 ) 2 , we have 

to,# 2 )< / \H\ l2 -H l 2 /2 \ 2 dF n + J |(1- ^-(l-fli^dPn, 

where P n is the empirical distribution of Z\,Z2, ■ ■ ■ ,z n . Both the classes 
is a c. d.f.}and{(l- J ff) 1 / 2 :H is a c.d.f.} have e-entropy bounded 
by a multiple of e~ l , by Theorem 2.7.5 of [31]. Thus, any e n > n -1 / 3 satis- 
fies (3.2). 

By easy calculations, we have 

K,(H ,H) = /f ( Zj )log^M + (1 - g ofe))log ^^ , 



*> £ 2ff *> ( log Ifff) 2 + 2(1 - ( log T^f©) 2 ' 

Under the conditions of the theorem, the numbers Ho(zi) are bounded away 
from and 1. By Taylor's expansion, for any 5 > 0, there exists a constant 
C (depending on S) such that 

sup sup (p(log-) + (1 _ p )(logjj— -) ) < Ce 2 , r = l,2. 

5<p<l-S q:\q-p\<e V V <?/ \ ± — qJ J 

Therefore, with \\H — flo||oo = sup{\H(z) — Hq(z)\ : z G [a, 6]}, we have 
max(n- 1 Er=i^(^o,^),n" 1 Er=i^(^o,^))<||^-^o||L- Hence, in or- 
der to satisfy (3.4), it suffices to lower bound the prior probability of the set 

{#:||#-F ||oo<£}. 

For given a and 0, the base measure is 7((- — a)/ (3). For a given e > 0, 
partition the line into N < e _1 intervals E±,E2, ■ ■ ■ , En such that Hq{Ea) < e 
and such that the 7((- — a)/ /3)-probability of every set Ej (for j = 1, 2, . . . , TV) 
is between As and 1 for a given positive constant A. Existence of such a 
partition follows from the continuity of Hq. It easily follows that for every H 
such that Y,J=i \H{E j )-H Q {E j )\ < e, we have \\H — i/o||oo ^ £■ Furthermore, 
the conclusion is true even if (a,/3) varies over K. By Lemma 6.1 of [14], the 
prior probability of the set of all H satisfying J2jLi \H{Ej) — Ho(Ej)\ < e 
is at least exp(— ce^ 1 loge -1 ) for some constant c. Furthermore, a uniform 
estimate works for all {a, (5) G K. Hence, (3.4) holds for e n , the solution of 
ne 2 =e -1 log£~ 1 , or for e n = n _1 / 3 (logn) 1//3 , which is only slightly weaker 
than the minimax rate n _1//3 . □ 
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7.9. Interval censoring. Let T%, T2, . . . ,T n constitute an i.i.d. sample from 
a life distribution F on (0, 00), which is subject to interval censoring by in- 
tervals (li, u\), . . . , (l n , u n ). We assume that the intervals are either non- 
stochastic or else we work conditionally on the realized values. Putting 
(61,771), . . . , (6 n ,r] n ), where 5, = l{Tj < k] and r/; = l{k < T; < nj, i = 1,2, ... ,n, 
the likelihood is given by U?=i{F{h)) Sl (F( Ui ) - F(k))^{\ - 
We may put the Dirichlet process prior on F. Under mild assumptions on 
the true F$ and the base measure, the convergence rate under d n turns out 
to be n _1 / 3 (logn) 1 / 3 , which is the minimax rate, except for the logarithmic 
factor. Here, we use monotonicity of F to bound the e-entropy by a multiple 
of e _1 and we estimate prior probability concentration as exp(— ce~ l loge -1 ) 
using methods similar to those used in the previous subsection. The details 
are omitted. 

8. Proofs. In this section, we collect a number of technical proofs. For 
the proofs of the main results, we first present two lemmas. 

Lemma 9. Let d n and e n be semimetrics on for which tests satisfying 
the conditions of (2.2) exist. Suppose that for some nonincreasing function 
e 1 ^ N(e) and some e n > 0, 

(8.1) N(^-,{0e&:d n (O,9 o )<e},e n ^ <N(e) foralle>e n . 

Then for every e > e n , there exist tests 4> n , n > 1, (depending on e) such 
that P^cpn < N(e) igS^ and P e (n) (l - <f> n ) < e ~ Kn£ ^ 2 for all9eQ such 
that d n (0, 9q) > je and for every j £ N. 



Proof. For a given j G N, choose a maximal set of points in Qj = {9 € 
: je < d n (9, 9q) < (j + l)e} with the property that e n (9, 9') > je£ for every 
pair of points in the set. Because this set of points is a je^-net over Qj 
for e n and because (j + l)e < 2je, this yields a set 0^ of at most N(2je) 
points, each at d n -distance at least je from 9q, and every 9 £ 0j is within 
e n -distance je^ of at least one of these points. (If Qj is empty, we take 0^ 
to be empty also.) By assumption, for every point 9\ S 0^, there exists a 
test with the properties as in (2.2), but with e replaced by je. Let cj) n be the 
maximum of all tests attached in this way to some point 9\ 6 0^ for some 
j G N. Then 

, . 00 00 P -Kne 2 



<Vn <EE *- Knj£ < E N(2je)e~ K ^ < N(e)-_ 
j=ie 1 ee' J j=i 
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and for every j E N, 

sup P e (n) (1 - <j> n ) < sup e ~ Kni2£2 < e ~ Knj2e2 , 
'>■ U. 

where we have used the fact that for every 9 6 Oj, there exists a test </> with 
4> n >4> and Pi (1 — </>) — eT Kn%2£2 . This concludes the proof. □ 

Lemma 10. For k > 2, every e > and every probability measure t\. n 
supported on the set B n (0Q,e; k), we have, for every C > 0, 

(8-2) PtH P -hdU n (9)<e-(^ 2 )< 



6 ° ^ P ( S ' C k ine^ 

Proof. By Jensen's inequality applied to the logarithm, with = 

log^/ri?)' we have lo g/ri n) Mo } ) dtl ^ e ) ^ J l ^ dn ^- Thus > the p rob - 

ability in (8.2) is bounded above by 

(8.3) P^ (/ (l n ,g - P^flnfi) dll n {e) < -n(l + C)e 2 - J P^l nfi dtl n {9)^ . 

For every 9 £ B n (9o,e; k), we have Pq^ l n ,6 = ~~ K(p$o iPo ) ^ —ne 2 . Conse- 
quently, by Fubini's theorem and the assumption that n n is supported on 
this set, the expression on the right-hand side of (8.3) is bounded above by 
—Cne 2 . An application of Markov's inequality yields the upper bound 

P^\I(ln,e-Pt l n,e)dfl n (9)A0\ k P$> J \l nfi - pff l n , e \ k dll n {9) 

(Cne 2 ) k ~ (Cne 2 ) k 

by another application of Jensen's inequality. The right-hand side is bounded 
by C~ k {ne 2 )~ k / 2 , by the assumption on fl n . This concludes the proof. □ 

Proof of Theorem 1. By Lemma 9, applied with N(e) = exp(ne^) 
(constant in e) and e = Me n in its assertion, where M > 2 is a large con- 
stant to be chosen later, there exist tests (j) n that satisfy Pg™'(j) n < e ne ™(l — 

e -KnM^ely\ e -KnM^el and pW^ _ ^) < e -KnM^elf for &U Q £ Q n such 

that d n (9, 9q) > Me n j and for every j £ N. The first assertion implies that if 
M is sufficiently large to ensure that KM 2 — 1 > KM 2 /2, then as n — > oo, 
for any J > 1 , we have 

(8.4) P ( o n) [n n (d n (0,0o) > JMe„|xW)^] < P, ( n Vn < e~ KM2n ^/\ 

Setting nJ = {^e„: Me n j < d n (9, 9 ) < Me n (j + 1)} and using (2.2), we 
obtain, by Fubini's theorem, 

» 



i-5) ^ 



y ^ effing) (i-^) <e-^ M2 ^ 2 n n (e nj ) 
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Fix some C > 0. By Lemma 10, we have, on an event A n with probability 
at least 1 - C~ k (nel )~ fc/2 , 

r ( n ) r ( n ) 

/ ^dU n (9) > / \dYl n {6)>e-^ +c >^Ii n {B n {9^e n -k))- 

J p K e J J B n (6 ,e n ;k) p^> 

Hence, decomposing {9 £ : d n (9, Qq) > JMe n } = Uj>j6„j and using (8.5), 
the last display and (2.5), we have, for every sufficiently large J, 

P^ ] [n n (9 £ 6 n : d n (9,9 ) > Je n M\xW)(l - n )Uj 

< J2 e -nel{KM*f-l-C-\KhPf)^ 
3>J 

by assumption (2.5). This converges to zero as n — ► oo for fixed C and fixed, 
sufficiently large M and J if ne\ — ► oo; it converges to zero for fixed M and 
C as J = J n — ► oo if ne 2 is bounded away from zero. 

Combining the preceding results, we have, for sufficiently large M and J, 



00 

(8.6) 



pl n) n„(0 e e : d n (e, e Q ) > Me n j\x^] 



< 1 _l o.-ifM 2 ^^ , V- p -nei(±KM 2 f-l-C) 



The rest of the conclusion follows easily; see the proof of Theorem 2.4 of [14]. 

□ 

Proof of Theorem 2. If s n > n~ a and fe(l - 2a) > 2 for a £ (0, 1/2), 

then n£ 2 n — > oo and X^i( ne n) _A ^ 2 < 00 • For C = V 2 > tne S rst term on tne 
right-hand side of (8.6) dominates and the sum over n of the terms in (8.6) 
converges. The result (i) follows by the Borel-Cantelli lemma. 

For assertion (ii), we note that e n > n~ a and k{\ — 2a) > 4a together 
imply that (ne^ l )~ k ^ 2 < . The other terms are exponentially small. □ 

PROOF of Lemma I. Because pjj* (pg /pg^) < 1, Fubini's theorem im- 
plies that Pj n) [je\e>? ) * /v { $)dR n (P)) < n n (G\G n ). Let the events A n be 
as in the proof of Theorem 1, so that the denominator of the posterior is 
bounded below by e~( 1+c ^ n£n Il n (B n (9o, e n \ k)) on A n . Combining this with 
the preceding display gives 

P {n) \U (9 4 9 1 < n n(Q\Qn) e(1+C) " £ " < q(1) e -nej(l-C) 

Pe in n (9 ? O n \X )l An \< Un(Bn{9oj£n . k)) <o(l)e 

by the assumption on Il n (0 \ n ). The rest of the proof can be completed 
along the lines of that of Theorem 2.4 of [14]. □ 
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